In [2]:
import pandas as pd

This notebook intends to explore and describe the dataset curated of fusion peptides used in this study. The column 'seq_vfp' has the fusion peptide and the column 'Sequence_fusogenic' contains the larger fusion protein.

In [4]:
path = 'data/viral_fp_curated.csv'
vfp_curated = pd.read_csv(path)
In [3]:
vfp_curated
Out[3]:
idProtein Name Class Activation Name_Fusogenic_Unit Location_Fusogenic Sequence_fusogenic UniProtID NcbiID idTaxonomy ... idTaxonomy.1 CommonName Family Genre Species SubSpecies NcbiTax FP corrected DL Notes DL seq_vfp
0 2 Envelope glycoprotein I binding to receptor Envelope protein p15E 471-650 EPVSLTLALLLGGLTMGGIAAGVGTGTTALVATQQFQQLQAAMHDD... P03386 NaN 2 ... 2 AKV murine leukemia virus (AKR (endogenous) mu... Retroviridae Gammaretrovirus AKR murine leukemia virus NaN 11791 NaN NaN VSLTLALLLGGLTMGGIAAGV
1 4 Envelope glycoprotein gp95 I binding to receptor gp37 402-606 STSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... P03397 NaN 4 ... 4 Avian leukosis virus RSA (RSV-SRA) (Rous sarco... Retroviridae Alpharetrovirus Rous sarcoma virus strain Schmidt-Ruppin A 269446 NaN NaN GPTARIFASILAPGVAAAQAL
2 5 Fusion glycoprotein F0 I binding to receptor F1 103-537 FVLGAIALGVATAAAVTAGVAIAKTIRLEGEVAAIKGALRKTNEAV... Q2Y2M3 NaN 5 ... 5 Avian metapneumovirus (isolate Canada goose/Mi... Pneumoviridae Metapneumovirus Avian metapneumovirus 15a 652954 NaN NaN FVLGAIALGVATAAAVTAGVAI
3 6 Envelope glycoprotein I binding to receptor gp22 393-582 AVQFIPLLVGLGITGATLAGGTGLGVSVHTYHKLSNQLIEDVQALS... P03399 NaN 6 ... 6 Avian reticuloendotheliosis virus Retroviridae Gammaretrovirus Reticuloendotheliosis virus NaN 11636 NaN NaN FIPLLVGLGITGATLAGGTGL
4 7 Envelope glycoprotein I binding to receptor gp37 79-257 SVSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... P33498 NaN 7 ... 7 Avian retrovirus RPL30 Retroviridae Alpharetrovirus Avian leukosis virus RPL30 31671 NaN NaN GPTARIFASILAPGVAAAQAL
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
398 798 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGVADRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... Q9YRV3 NaN 798 ... 798 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus Trinidad/79A/1979 407137 NaN NaN DRGWGNGCGLFGKG
399 799 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGITDRDFIEGVHGGTWVSASLEQDKCVTVMAPDKPSLDISLQ... Q1X880 NaN 799 ... 799 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus isolate Uganda/A7094A4/1948 407139 NaN NaN DRGWGNGCGLFGKG
400 800 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGITDRDFIEGVHGGTWVSATLEQGKCVTVMAPDKPSLDISLQ... Q074N0 NaN 800 ... 800 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus isolate Ethiopia/Couma/1961 407141 NaN NaN DRGWGNGCGLFGKG
401 801 Genome polyprotein NaN binding to receptor Envelop protein E 268 - 778 AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... Q89277 NaN 801 ... 801 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus strain French neurotropic vaccine 407135 NaN NaN DRGWGNGCGLFGKG
402 802 Genome polyprotein NaN binding to receptor Envelop protein E 291 - 790 IRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELV... Q32ZE1 NaN 802 ... 802 Zika virus Flaviviridae Flavivirus Zika virus NaN 64320 NaN NaN DRGWGNGCGLFGKGSL

403 rows × 26 columns

Repeats on fusion peptides¶

How many fusion peptide identical in the dataset. The fusion peptides identical were not removed as they may have different fusion protein associated.

In [4]:
dup = vfp_curated.duplicated(subset=['seq_vfp'], keep = False)
vfp_curated[dup]
Out[4]:
idProtein Name Class Activation Name_Fusogenic_Unit Location_Fusogenic Sequence_fusogenic UniProtID NcbiID idTaxonomy ... idTaxonomy.1 CommonName Family Genre Species SubSpecies NcbiTax FP corrected DL Notes DL seq_vfp
0 2 Envelope glycoprotein I binding to receptor Envelope protein p15E 471-650 EPVSLTLALLLGGLTMGGIAAGVGTGTTALVATQQFQQLQAAMHDD... P03386 NaN 2 ... 2 AKV murine leukemia virus (AKR (endogenous) mu... Retroviridae Gammaretrovirus AKR murine leukemia virus NaN 11791 NaN NaN VSLTLALLLGGLTMGGIAAGV
1 4 Envelope glycoprotein gp95 I binding to receptor gp37 402-606 STSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... P03397 NaN 4 ... 4 Avian leukosis virus RSA (RSV-SRA) (Rous sarco... Retroviridae Alpharetrovirus Rous sarcoma virus strain Schmidt-Ruppin A 269446 NaN NaN GPTARIFASILAPGVAAAQAL
2 5 Fusion glycoprotein F0 I binding to receptor F1 103-537 FVLGAIALGVATAAAVTAGVAIAKTIRLEGEVAAIKGALRKTNEAV... Q2Y2M3 NaN 5 ... 5 Avian metapneumovirus (isolate Canada goose/Mi... Pneumoviridae Metapneumovirus Avian metapneumovirus 15a 652954 NaN NaN FVLGAIALGVATAAAVTAGVAI
4 7 Envelope glycoprotein I binding to receptor gp37 79-257 SVSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... P33498 NaN 7 ... 7 Avian retrovirus RPL30 Retroviridae Alpharetrovirus Avian leukosis virus RPL30 31671 NaN NaN GPTARIFASILAPGVAAAQAL
7 11 Spike glycoprotein I binding to receptors S2 750-1235 SYSASQFQLAVLNYTSPIVVTPINSSGFTAAIPTNFSFSLTQEYIE... Q0Q4F2 NaN 11 ... 11 Bat coronavirus 133/2005 (BtCoV) (BtCoV/133/2005) Coronaviridae Betacoronavirus Tylonycteris bat coronavirus HKU4 BtCoV/133/2005 389230 NaN NaN SSLLGSIAGAGWTAGLSSFAAIPFA
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
397 797 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLQ... Q1X881 NaN 797 ... 797 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus isolate Angola/14FA/1971 407140 NaN NaN DRGWGNGCGLFGKG
398 798 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGVADRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... Q9YRV3 NaN 798 ... 798 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus Trinidad/79A/1979 407137 NaN NaN DRGWGNGCGLFGKG
399 799 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGITDRDFIEGVHGGTWVSASLEQDKCVTVMAPDKPSLDISLQ... Q1X880 NaN 799 ... 799 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus isolate Uganda/A7094A4/1948 407139 NaN NaN DRGWGNGCGLFGKG
400 800 Genome polyprotein NaN binding to receptor Envelop protein E 286 - 778 AHCIGITDRDFIEGVHGGTWVSATLEQGKCVTVMAPDKPSLDISLQ... Q074N0 NaN 800 ... 800 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus isolate Ethiopia/Couma/1961 407141 NaN NaN DRGWGNGCGLFGKG
401 801 Genome polyprotein NaN binding to receptor Envelop protein E 268 - 778 AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... Q89277 NaN 801 ... 801 Yellow fever virus Flaviviridae Flavivirus Yellow fever virus strain French neurotropic vaccine 407135 NaN NaN DRGWGNGCGLFGKG

257 rows × 26 columns

In [5]:
counts = vfp_curated['seq_vfp'].value_counts()
counts
Out[5]:
VYPFMWGGAYCFCDTENT                  12
VSLTLALLLGGLTMGGIAAGV               11
DRGWGNGCGLFGKG                      11
AVGIGALFLGFLGAAGSTMGA                9
GVFVLGFLGFLATAGSAMGAA                7
                                    ..
VVVSATLCSALYVGDLCGGVMLAAQMFIVSPQ     1
MVGAAALCSAMYLGDLCGGVFLVGQLFTFRPR     1
LAATSASLFPPWTAAAGVPFY                1
KASSRSAIEDLLFDKV                     1
DRGWGNGCGLFGKGSL                     1
Name: seq_vfp, Length: 216, dtype: int64
In [6]:
counts[counts>1]
Out[6]:
VYPFMWGGAYCFCDTENT           12
VSLTLALLLGGLTMGGIAAGV        11
DRGWGNGCGLFGKG               11
AVGIGALFLGFLGAAGSTMGA         9
GVFVLGFLGFLATAGSAMGAA         7
                             ..
FIGAIIGGVALGVATAAQITAAAAL     2
VYPLLWGAAHCFCSTENT            2
WGCNPSDCPGVGTGCTACGLYLD       2
GVGLVIMLVIMAIVAAAGASL         2
LIGAIIGGVALGVATAAQITAASAL     2
Name: seq_vfp, Length: 70, dtype: int64

We can see that there are 216 different fusion peptides and 70 of them have more than one occurrence.

Dataset entries without fusion protein¶

In [7]:
l_or = len(vfp_curated)
l_na_protein = len(vfp_curated.dropna(subset=['Sequence_fusogenic']))

print('The dataset origina has {} entries'.format(l_or))
print('This dataset contains {} entries of fusion peptides with fusion protein'.format(l_na_protein))
print('This dataset contains {} entries of fusion peptides without fusion protein'.format(l_or - l_na_protein))

vfp_uni = vfp_curated.drop_duplicates(subset=['seq_vfp'])
l_or_uni = len(vfp_uni)
l_na_protein_uni = len(vfp_uni.dropna(subset=['Sequence_fusogenic']))

print('######')
print('The dataset with just one entry per fusion peptide contains {} entries'.format(l_or_uni))
print('This dataset contains {} entries of fusion peptides with fusion protein'.format(l_na_protein_uni))
print('This dataset contains {} entries of fusion peptides without fusion protein'.format(l_or_uni - l_na_protein_uni))
The dataset origina has 403 entries
This dataset contains 379 entries of fusion peptides with fusion protein
This dataset contains 24 entries of fusion peptides without fusion protein
######
The dataset with just one entry per fusion peptide contains 216 entries
This dataset contains 192 entries of fusion peptides with fusion protein
This dataset contains 24 entries of fusion peptides without fusion protein

There are 379 entries of fusion peptides with fusion peptides. 216 if considering only fusion peptides unique. There are 24 entries without fusion protein.

sizes of fusion proteins and fusion peptide¶

In [8]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(2,2, figsize=(20, 15))

plt.subplot(221)
fp_len = vfp_curated['seq_vfp'].str.len()
plt.hist(fp_len)
plt.title('len of fusion peptides w/ dup')

plt.subplot(222)
fp_len = vfp_curated['Sequence_fusogenic'].str.len()
plt.hist(fp_len)
plt.title('len of fusion proteins w/ dup')

plt.subplot(223)
fp_len = vfp_uni['seq_vfp'].str.len()
plt.hist(fp_len)
plt.title('len of fusion peptides w/o dup')

plt.subplot(224)
fp_len = vfp_uni['Sequence_fusogenic'].str.len()
plt.hist(fp_len)
plt.title('len of fusion proteins w/o dup')

plt.subplots_adjust( hspace=0.2, wspace=0.2)

plt.show()
No description has been provided for this image

Get the sequences above the 35 aminoacids as it is not very common

In [9]:
vfp_curated[vfp_curated['seq_vfp'].str.len()>35]
Out[9]:
idProtein Name Class Activation Name_Fusogenic_Unit Location_Fusogenic Sequence_fusogenic UniProtID NcbiID idTaxonomy ... idTaxonomy.1 CommonName Family Genre Species SubSpecies NcbiTax FP corrected DL Notes DL seq_vfp
13 17 Envelope glycoprotein p57 II ph Envelope glycoprotein p29 250-503 DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... P52638 NaN 17 ... 17 Borna disease virus (strain V) (BDV) Bornaviridae NaN Borna disease virus strain V 928296 NaN NaN ILQSLLLGVFGTGIASASQFLRSWLNHPDIIGYIVNGVGVVW
14 18 Envelope glycoprotein p57 II ph Envelope glycoprotein p29 250-503 DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... Q8BB27 NaN 18 ... 18 Borna disease virus 1 (BoDV-1) Bornaviridae Orthobornavirus Mammalian 1 orthobornavirus 1 1714621 NaN NaN ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW
253 334 HA I NaN HA2 subunit NaN NaN NaN NaN 316 ... 316 Infectious salmon anemia virus (isolate Atlant... Orthomyxoviridae Isavirus Salmon isavirus isolate 810/9/99 652965 NaN NaN GIFGAIAGFIENGWEGMVDGWYGFRHQNSEGTGQAADL
270 559 Pre-glycoprotein polyprotein GP complex I low pH Glycoprotein G2 252-485 AFFSWSLTDSSGKDTPGGYCLEEWMLVAAKMKCFGNTAVAKCNLNH... P26313 NaN 558 ... 558 Jembrana disease virus (JDV) Retroviridae Lentivirus Jembrana disease virus NaN 36370 NaN NaN AFFSWSLTDSSGKDTPGGYCLEEWMLVAAKMKCFGNTAV
280 570 Pre-glycoprotein polyprotein GP complex I low pH GP2 260-491 GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAVAKCNEKH... P08669 NaN 570 ... 570 Lassa virus Arenaviridae Mammarenavirus Lassa mammarenavirus strain Josiah 11622 GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAV According to https://doi.org/10.1099/vir.0.829... GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAV
281 571 Pre-glycoprotein polyprotein GP complex I low pH GP2 259-490 GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAVAKCNEKH... P17332 NaN 571 ... 571 Lassa virus Arenaviridae Mammarenavirus Lassa mammarenavirus GA391 11621 GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAV Same as prevois, but from a different strain GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAV
283 574 Pre-glycoprotein polyprotein GP complex I low pH GP2 266-498 GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAVAKCNVNH... P07399 NaN 574 ... 574 Lymphocytic choriomeningitis virus Arenaviridae Mammarenavirus Lymphocytic choriomeningitis mammarenavirus (strain WE) 11627 GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAV Same as Lassa GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAV
284 576 glycoprotein precursor I low pH GP2 496 MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQF... NaN AAS77879.1 576 ... 576 Machupo virus Arenaviridae Mammarenavirus Machupo mammarenavirus NaN 11628 AFFSWSLTDSSGKDMPGGYCLEEWMLIAAKMKCFGNTAV Same as Lassa AFFSWSLTDSSGKDMPGGYCLEEWMLIAAKMKCFGNTAV
330 659 Glycoprotein G NaN low pH NaN 20-524 KFPIYTIPDKLGPWSPIDIHHLSCPNNLVVEDEGCTNLSGFSYMEL... P08667 NaN 659 ... 659 Rabies virus Rhabdoviridae Lyssavirus Rabies lyssavirus strain Pasteur vaccin 103929 NaN NaN NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS

9 rows × 26 columns

In [10]:
vfp_curated[vfp_curated['seq_vfp'].str.len()>40]
Out[10]:
idProtein Name Class Activation Name_Fusogenic_Unit Location_Fusogenic Sequence_fusogenic UniProtID NcbiID idTaxonomy ... idTaxonomy.1 CommonName Family Genre Species SubSpecies NcbiTax FP corrected DL Notes DL seq_vfp
13 17 Envelope glycoprotein p57 II ph Envelope glycoprotein p29 250-503 DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... P52638 NaN 17 ... 17 Borna disease virus (strain V) (BDV) Bornaviridae NaN Borna disease virus strain V 928296 NaN NaN ILQSLLLGVFGTGIASASQFLRSWLNHPDIIGYIVNGVGVVW
14 18 Envelope glycoprotein p57 II ph Envelope glycoprotein p29 250-503 DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... Q8BB27 NaN 18 ... 18 Borna disease virus 1 (BoDV-1) Bornaviridae Orthobornavirus Mammalian 1 orthobornavirus 1 1714621 NaN NaN ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW
330 659 Glycoprotein G NaN low pH NaN 20-524 KFPIYTIPDKLGPWSPIDIHHLSCPNNLVVEDEGCTNLSGFSYMEL... P08667 NaN 659 ... 659 Rabies virus Rhabdoviridae Lyssavirus Rabies lyssavirus strain Pasteur vaccin 103929 NaN NaN NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS

3 rows × 26 columns

T not take to much in consideration the end and begining of sequences, all the fusion peptides were kept even if they exceed 35 aminoacids.

With or without suplicates the distribution is the same (however in lower frequencies). The fusion peptides are distributed around the 20 and 25 aminoacids. The fusion porteins are distributed around the 300 and 400 aminoacids.

Distribution across fusion peptide classes and family virus¶

In [11]:
counts = vfp_curated['Class'].value_counts()
print(counts)

counts_uni = vfp_uni['Class'].value_counts()
print(counts_uni)

fig, axs = plt.subplots(1,2, figsize=(20, 15))

plt.subplot(121)
plt.pie(counts, labels = counts.index)
plt.title('distribution of class w/ dup')

plt.subplot(122)
plt.pie(counts_uni, labels=counts.index)
plt.title('distribution of class w/o dup')
plt.legend()
plt.show()
I      261
II      78
III     27
Name: Class, dtype: int64
I      142
II      48
III      8
Name: Class, dtype: int64
No description has been provided for this image
In [12]:
counts = vfp_curated['Family'].value_counts()
print(counts)
print('#######\n \n \n')
counts_uni = vfp_uni['Family'].value_counts()
print(counts_uni)


fig, axs = plt.subplots(1,2, figsize=(20, 20))

plt.subplot(121)
plt.pie(counts, labels = counts.index)
plt.title('distribution of class w/ dup')

plt.subplot(122)
plt.pie(counts_uni, labels=counts.index) # , textprops={'fontsize': 10}
plt.title('distribution of class w/o dup')
# plt.legend(prop = {'size' : 10})
plt.subplots_adjust( hspace=0.2, wspace=0.5)

plt.show()
Retroviridae        124
Flaviviridae         69
Paramyxoviridae      36
Retroviridae         30
Togaviridae          27
Orthomyxoviridae     27
Herpesviridae        24
Filoviridae          15
Coronaviridae        14
Paramyxoviridae      13
Pneumoviridae         8
Arenaviridae          4
Peribunyaviridae      2
Bornaviridae          2
Bunyaviridae          2
Rhabdoviridae         2
Nairoviridae          1
Hepadnaviridae        1
Pneumoviridae         1
Phenuiviridae         1
Name: Family, dtype: int64
#######
 
 

Retroviridae        70
Flaviviridae        43
Paramyxoviridae     19
Retroviridae        13
Orthomyxoviridae    13
Coronaviridae       11
Paramyxoviridae      8
Togaviridae          8
Filoviridae          7
Pneumoviridae        5
Herpesviridae        5
Arenaviridae         4
Bornaviridae         2
Rhabdoviridae        2
Nairoviridae         1
Bunyaviridae         1
Hepadnaviridae       1
Peribunyaviridae     1
Pneumoviridae        1
Phenuiviridae        1
Name: Family, dtype: int64
No description has been provided for this image

seqs that only differ in one, 2 or 3 character¶

In [13]:
different_fp = vfp_curated['seq_vfp'].unique()
print(len(different_fp))
different_fp = list(different_fp)


from Levenshtein import distance 
import copy 
def match(s1, s2):
    return distance(s1, s2) #<= 1


def get_dif(different_fp, dist_accepted):
    ls = []
    ls_peptides  = copy.deepcopy(different_fp)
    
    for peptide_1 in ls_peptides:
        count = -1 # to not cunt the peptide itself. 
#         ls_peptides.remove(peptide_1)
        for peptide_2 in ls_peptides:
            dist = match(peptide_1,peptide_2)
            if dist <dist_accepted+1: # aceita 2 mudancas - 3.  se aceita uma mudanca 2  
                count += 1
#                 ls_peptides.remove(peptide_2)
            else:
                pass
        ls.append(count)
    return ls 

ls_dif2 = get_dif(different_fp, dist_accepted = 2) # with 2 positions changed
ls_dif1 = get_dif(different_fp, dist_accepted = 1) # with just one position changed
ls_dif3 = get_dif(different_fp, dist_accepted = 3) # with 3 position changed


print('\n number of fpep with one difference in aminoacid referent to the fpep in that list positon. \n ')
print(sorted(ls_dif1, reverse=True))


print('\n number of fpep with2 difference in aminoacid referent to the fpep in that list positon. \n')
print(sorted(ls_dif2, reverse=True))


print('\n number of fpep with 3 difference in aminoacid referent to the fpep in that list positon. \n ')
print(sorted(ls_dif3, reverse=True))
    
216

 number of fpep with one difference in aminoacid referent to the fpep in that list positon. 
 
[5, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

 number of fpep with2 difference in aminoacid referent to the fpep in that list positon. 

[13, 12, 10, 9, 8, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

 number of fpep with 3 difference in aminoacid referent to the fpep in that list positon. 
 
[21, 19, 18, 18, 17, 15, 15, 15, 15, 14, 13, 13, 11, 9, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

A simple function to have an overview of the number of fusion peptides that may be very similar to one another.

The function grabs a sequence and checks in the list of non duplicates if they have just one difference from the other in the list. It counts all of this sequences (this may be oversimplist) . The printed lists are sorted (not by order of dataframe).

To have an idea, there is a sequence with 5 sequences in the dataset with just one difference in aminoacid. 2 sequences with 4 sequences with one aa changed and 9 sequences with 3 similar seqs. (as the sequences are not deleted, some of these may belong to the same group)

The re are one sequence with 13 sequences with 2 aa difference.

If we go to 3 aas difference, there is one sequence with 21 sequences just differing in 3 aas.

The similarity of sequences is a thing to have into account. The next step is calcualting the bilogical similarity and aggregate this sequences using the CDHIT.

In [14]:
# comapre theses sequences

similarity CDHIT¶

In [15]:
#   cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 16000 –d 0 -T 8
#   cd-hit -i db -o db90 -c 0.9 -n 5 -M 16000 –d 0 -T 8
#   where
#   db is the filename of input,
#   db90 is output, 
#   -c 1.0, means 100% identity, is the clustering threshold
#   -c 0.9, means 90% identity, is the clustering threshold
#   -n 5 is the word size
#   -d 0 use sequence name in fasta header till the first white space
#   -M 16000, to use 16GB RAM
#   -T 8, to use 8 threads
# Choose of word size:
# -n 5 for thresholds 0.7 ~ 1.0
# -n 4 for thresholds 0.6 ~ 0.7
# -n 3 for thresholds 0.5 ~ 0.6
# -n 2 for thresholds 0.4 ~ 0.5


# protein dataset in fasta format 
# output are two files: a fasta file of representative sequences and a text file of list of clusters.


print(len(vfp_curated))
# make fasta
from IO import make_fasta
make_fasta(df = vfp_curated , filename_fasta = 'fasta_curated.fasta' ,seq_column = 'seq_vfp')


# clustering 90 % similarity 
!cd-hit -i fasta_curated.fasta -o results_cdhit90 -c 0.9 -n 5 -M 16000 -d 0 -T 8
print('finished')


from IO import read_fasta
# retrieves a fasta file with representative sequences 
seq100 = read_fasta('fasta_curated.fasta', th = 100)
seq90 = read_fasta('results_cdhit90', th=90)
403
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit90 -c
         0.9 -n 5 -M 16000 -d 0 -T 8

Started: Sat Dec 24 11:21:27 2022
================================================================
                            Output                              
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!

>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 8 X 10M = 84M
Table           : 2 X 65M = 130M
Miscellaneous   : 0M
Total           : 214M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1973149542

# comparing sequences from          0  to        402
---------- new table with      120 representatives

      402  finished        120  clusters

Approximated maximum memory consumption: 214M
writing new database
writing clustering information
program completed !

Total CPU time 0.15
finished
number of sequences with 100 % threshold: 403
number of sequences with 90 % threshold: 120
In [16]:
# and a cluster file
seq90 = read_fasta('results_cdhit90.clstr', th=90)

# >Cluster 0
# 0	458aa, >P0DTC2_Spike... *
# >Cluster 1
# 0	91aa, >D3IZQ2_Viral... *
# >Cluster 2
number of sequences with 90 % threshold: 120
In [17]:
# clustering 80 % similarity 
!cd-hit -i fasta_curated.fasta -o results_cdhit80 -c 0.8 -n 5 -M 16000 -d 0 -T 8
print('finished')

# clustering 50 % similarity 
!cd-hit -i fasta_curated.fasta -o results_cdhit50 -c 0.5 -n 3 -M 16000 -d 0 -T 8 # for thresholds lower word size must be increased
print('finished')

from IO import read_fasta
# retrieves a fasta file with representative sequences 
seq100 = read_fasta('fasta_curated.fasta', th = 100)
seq80 = read_fasta('results_cdhit80', th=80)
seq50 = read_fasta('results_cdhit50', th=50)
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit80 -c
         0.8 -n 5 -M 16000 -d 0 -T 8

Started: Sat Dec 24 11:21:27 2022
================================================================
                            Output                              
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!

>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 8 X 10M = 84M
Table           : 2 X 65M = 130M
Miscellaneous   : 0M
Total           : 214M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1973149542

# comparing sequences from          0  to        402
---------- new table with       91 representatives

      402  finished         91  clusters

Approximated maximum memory consumption: 214M
writing new database
writing clustering information
program completed !

Total CPU time 0.16
finished
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit50 -c
         0.5 -n 3 -M 16000 -d 0 -T 8

Started: Sat Dec 24 11:21:28 2022
================================================================
                            Output                              
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!

>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 0M
Buffer          : 8 X 10M = 84M
Table           : 2 X 0M = 0M
Miscellaneous   : 0M
Total           : 84M

Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1989448902

# comparing sequences from          0  to        402
---------- new table with       40 representatives

      402  finished         40  clusters

Approximated maximum memory consumption: 84M
writing new database
writing clustering information
program completed !

Total CPU time 0.08
finished
number of sequences with 100 % threshold: 403
number of sequences with 80 % threshold: 91
number of sequences with 50 % threshold: 40

From 403 sequences, 216 are repeating sequences. From this, using a cutoff of 90 % similarity (and word size 5) 120 sequences clusters with less than 90 % similarity. we have 81 sequences with less than 80 % similarity and 40 sequences with 50 % similarity (the word size here is 3)

WEBLOGOS¶

Create protein logos (with motif from Biopython and logomaker) for sequences with different levels of similarity.

In [19]:
from Bio import motifs
import logomaker
def create_log(list_seqs):
    maxlen = len(max(list_seqs, key = len)) 
    seq_pad = [str(seq).ljust(maxlen, 'X') for seq in list_seqs if '['not in str(seq)]
    
    m = motifs.create(seq_pad, alphabet= 'ACDEFGHIKLMNPQRSTVWYX')

    # create Logo object
    ww_logo = logomaker.Logo(pd.DataFrame(m.counts),
                             color_scheme='NajafabadiEtAl2017',
                             vpad=.1,
                             width=.8)

    print(m.consensus)
          

          
create_log(list_seqs = seq100)
# create_log(list_seqs = seq90)
create_log(list_seqs = seq80)
create_log(list_seqs = seq50)
    
 Warning: Character 'X' is not in color_dict. Using black.
GVGAAAGLGALGGAAAATXGAXXXXXXXXXXXXXXXXXXXXXXXXXX
 Warning: Character 'X' is not in color_dict. Using black.
GVGLAIALGALGVGAAAGAGXXXXXXXXXXXXXXXXXXXXXXXXXXX
 Warning: Character 'X' is not in color_dict. Using black.
GLGLLLLLGGALAGAAASLAXXXXXXXXXXXXXXXXXXXXXXXXXXX
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Create protein logos (with motif from Biopython and logomaker) for sequences from different families.

In [20]:
for family in vfp_curated['Family'].unique():
    vfp_fam = vfp_uni.loc[vfp_uni['Family']==family] # using not the duplicates
    list_seqs = vfp_fam['seq_vfp']
    print(family)
    create_log(list_seqs)
Retroviridae
 Warning: Character 'X' is not in color_dict. Using black.
AVGLLGFLGGLLGAAGGTMGAXXXXXXXXXXXXXXXXXX
Pneumoviridae
 Warning: Character 'X' is not in color_dict. Using black.
FVEGAIALGVATAAAVTAGVAIXXXXXXX
Retroviridae 
 Warning: Character 'X' is not in color_dict. Using black.
VMLALATLLSGAAAGTGATAIXX
Coronaviridae
 Warning: Character 'X' is not in color_dict. Using black.
SAATGAIADFGWFAFAXSXXXXXXX
Bornaviridae 
 Warning: Character 'X' is not in color_dict. Using black.
ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW
Paramyxoviridae
 Warning: Character 'X' is not in color_dict. Using black.
FAGVVIGGAALGVATAAQITAGVALXXXXXX
Peribunyaviridae
 Warning: Character 'X' is not in color_dict. Using black.
WGCEEFGCLAVSDGCVFGSCQD
Togaviridae
 Warning: Character 'X' is not in color_dict. Using black.
VYPFMWGGAYCFCDSENTXXX
Nairoviridae
 Warning: Character 'X' is not in color_dict. Using black.
WRCNPTWCWGVGTGCTCCGLDVK
Flaviviridae
 Warning: Character 'X' is not in color_dict. Using black.
LVGAATLCSALYVGDLCGSVFLVGQLFTFSPR
Filoviridae
 Warning: Character 'X' is not in color_dict. Using black.
GAAAGLAWIPYFGPAAXXXXX
Herpesviridae
 Warning: Character 'X' is not in color_dict. Using black.
PDFFRLAFLLAVSGFAFVNAA
Bunyaviridae
 Warning: Character 'X' is not in color_dict. Using black.
WGCNPSDCPGVGTGCTACGLYLD
Hepadnaviridae
 Warning: Character 'X' is not in color_dict. Using black.
MENITSGFLGPLLVLQAGFFLLTR
Orthomyxoviridae
 Warning: Character 'X' is not in color_dict. Using black.
GLFGAIAGFIENGWEGMIDGWYGFRXXXXXXXXXXXXX
Arenaviridae
 Warning: Character 'X' is not in color_dict. Using black.
GTFTWTLSDSEGKDTPGGYCLTRWMLIAAELKCFGNTAV
Pneumoviridae 
 Warning: Character 'X' is not in color_dict. Using black.
FLGLILGLGAAVTAGVALAKT
Paramyxoviridae 
 Warning: Character 'X' is not in color_dict. Using black.
FIGAIIGGVALGVATAAQITAASAL
Rhabdoviridae
 Warning: Character 'X' is not in color_dict. Using black.
NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS
Phenuiviridae
 Warning: Character 'X' is not in color_dict. Using black.
WGCGCFNVNPSCLFVHT
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Create protein logos (with motif from Biopython and logomaker) for sequences from different clusters.

In [23]:
cluster90 = 'results_cdhit50.clstr'
fasta_or = 'fasta_curated.fasta'

# >Cluster 0
# 0	47aa, >P08667_Glycoprotein... *
# >Cluster 1
# 0	42aa, >P52638_Envelope... *
# 1	42aa, >Q8BB27_Envelope... at 97.62%
# >Cluster 2
# 0	39aa, >P26313_Pre-glycoprotein... *
# 1	39aa, >nan_glycoprotein... at 94.87%

# >P03386_Envelope glycoprotein_I_0
# VSLTLALLLGGLTMGGIAAGV
# >P03397_Envelope glycoprotein gp95_I_1
# GPTARIFASILAPGVAAAQAL

# transform fasta_or in dataframe 
from Bio import SeqIO

dict_or = dict()
with open(fasta_or) as fasta_file:
    for seq_record in SeqIO.parse(fasta_file, 'fasta'):  # (generator)
        dict_or[seq_record.id] = seq_record.seq


import re
def read_cluster_file(cl_file, dict_or):
    dict_cluster = dict()
    key_list = []
    cluster_name = 'None'
    with open(cl_file) as cluster_file: 
        for line in cluster_file:
           
            if line.startswith('>'):
                dict_cluster[cluster_name] = key_list # from previous 
                # reset the name and the list 
                cluster_name = line # Cluster 0
                key_list = []
            
            else:
                m = re.search('>(.+?)\\.', line)
                name = m.group(1)
#                 name = 'between Z and ...' #>nan_HA_I_247...
                key_list.append(dict_or[name])
    dict_cluster.pop('None')
    return dict_cluster
clusters = read_cluster_file(cl_file=cluster90, dict_or=dict_or)
for list_seqs in clusters.values():
    create_log(list_seqs)
 Warning: Character 'X' is not in color_dict. Using black.
NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS
 Warning: Character 'X' is not in color_dict. Using black.
ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW
 Warning: Character 'X' is not in color_dict. Using black.
GTFTWTLSDSSGKDTPGGYCLTEWMLIAAELKCFGNTAV
 Warning: Character 'X' is not in color_dict. Using black.
GLFGAIAGFIEGGWEGMIDGWYGFRXXXXXXXXXXXXX
 Warning: Character 'X' is not in color_dict. Using black.
LVGAATLCSALYVGDLCGSVFLVGQLFTFSPR
 Warning: Character 'X' is not in color_dict. Using black.
FAGVVIGGAALGVATAAQITAAIALXXXXXX
 Warning: Character 'X' is not in color_dict. Using black.
NEESYYHIAARIATSIFALSEMGRTTEYF
 Warning: Character 'X' is not in color_dict. Using black.
SSLLGSIAGAGWTAGLSSFAAIPFA
 Warning: Character 'X' is not in color_dict. Using black.
MENITSGFLGPLLVLQAGFFLLTR
 Warning: Character 'X' is not in color_dict. Using black.
WGCNPSDCPGVGTGCTACGLYLD
 Warning: Character 'X' is not in color_dict. Using black.
NLRRIQEAGLGLANAITTVAKIS
 Warning: Character 'X' is not in color_dict. Using black.
NYAKLRSMGYALTGAVQTLSQIS
 Warning: Character 'X' is not in color_dict. Using black.
WGCEEFGCLAVSDGCVFGSCQD
 Warning: Character 'X' is not in color_dict. Using black.
AVGVGGFFLGFLGAAGSTMGAX
 Warning: Character 'X' is not in color_dict. Using black.
VSLTLALLLGGLTMGGIAAGV
 Warning: Character 'X' is not in color_dict. Using black.
GPTARIFASILAPGVAAAQAL
 Warning: Character 'X' is not in color_dict. Using black.
FIPLLVGLGITTAVSTGTTGL
 Warning: Character 'X' is not in color_dict. Using black.
AVGLAIFLLVLAIMAITASLT
 Warning: Character 'X' is not in color_dict. Using black.
FLGFLLGVGSAIASGVAVSKV
 Warning: Character 'X' is not in color_dict. Using black.
GIGLVIVLAIMAIIAAAGAGL
 Warning: Character 'X' is not in color_dict. Using black.
/home/martinha/miniconda3/envs/viral_fp/lib/python3.9/site-packages/logomaker/src/Logo.py:194: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig, ax = plt.subplots(1, 1, figsize=self.figsize)
TETLTTMFEVSVAFFKVGHAV
 Warning: Character 'X' is not in color_dict. Using black.
FGISAIVAAIVAATAIAASAT
 Warning: Character 'X' is not in color_dict. Using black.
VMLALATVLSMAGAGTGATAI
 Warning: Character 'X' is not in color_dict. Using black.
LAATSASLFPPWTAAAGVPFY
 Warning: Character 'X' is not in color_dict. Using black.
IGGIALGGLTSAVSIPFSLAI
 Warning: Character 'X' is not in color_dict. Using black.
PLFWRLTGLLATSGFAFVNAA
 Warning: Character 'X' is not in color_dict. Using black.
KDTFHLMFLFGLTHFLYSTRG
 Warning: Character 'X' is not in color_dict. Using black.
AVPVAVWLVSALAMGAGVAGG
 Warning: Character 'X' is not in color_dict. Using black.
GAAWGLAWIPYFGPAAXXXXX
 Warning: Character 'X' is not in color_dict. Using black.
FVAAIILGISALIAIITSFAV
 Warning: Character 'X' is not in color_dict. Using black.
VYPFMWGGAYCFCDTENTXXX
 Warning: Character 'X' is not in color_dict. Using black.
RRTVEMAFAYALALFAAARQ
 Warning: Character 'X' is not in color_dict. Using black.
PDIFLVLFQMLVAHFLVARG
 Warning: Character 'X' is not in color_dict. Using black.
MYKTPAIKDFGGFNFSQIL
 Warning: Character 'X' is not in color_dict. Using black.
GIPQMMFLYGIVHFSYSTK
 Warning: Character 'X' is not in color_dict. Using black.
GTEVSEALGGAGLTGGFY
 Warning: Character 'X' is not in color_dict. Using black.
RSARSAIEDLLFDKVXXX
 Warning: Character 'X' is not in color_dict. Using black.
WGCGCFNVNPSCLFVHT
 Warning: Character 'X' is not in color_dict. Using black.
DRGWGNGCGLFGKGXX
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image